Ten Pairs to Tag - Multilingual POS Tagging via Coarse Mapping between Embeddings

نویسندگان

  • Yuan Zhang
  • David Gaddy
  • Regina Barzilay
  • Tommi S. Jaakkola
چکیده

In the absence of annotations in the target language, multilingual models typically draw on extensive parallel resources. In this paper, we demonstrate that accurate multilingual partof-speech (POS) tagging can be done with just a few (e.g., ten) word translation pairs. We use the translation pairs to establish a coarse linear isometric (orthonormal) mapping between monolingual embeddings. This enables the supervised source model expressed in terms of embeddings to be used directly on the target language. We further refine the model in an unsupervised manner by initializing and regularizing it to be close to the direct transfer model. Averaged across six languages, our model yields a 37.5% absolute improvement over the monolingual prototypedriven method (Haghighi and Klein, 2006) when using a comparable amount of supervision. Moreover, to highlight key linguistic characteristics of the generated tags, we use them to predict typological properties of languages, obtaining a 50% error reduction relative to the prototype model.1

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An improved joint model: POS tagging and dependency parsing

Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...

متن کامل

Unsupervised Multilingual Learning for POS Tagging

We demonstrate the effectiveness of multilingual learning for unsupervised part-of-speech tagging. The key hypothesis of multilingual learning is that by combining cues from multiple languages, the structure of each becomes more apparent. We formulate a hierarchical Bayesian model for jointly predicting bilingual streams of part-of-speech tags. The model learns language-specific features while ...

متن کامل

Character Embeddings PoS Tagger vs HMM Tagger for Tweets

English. The paper describes our submissions to the task on PoS tagging for Italian Social Media Texts (PoSTWITA) at Evalita 2016. We compared two approaches: a traditional HMM trigram Pos tagger and a Deep Learning PoS tagger using both character-level and word-level embeddings. The character-level embeddings performed better proving that they can provide a finer representation of words that a...

متن کامل

معرفی رویکردی ماشینی با استفاده از الگوریتم لسک و برچسبدهی نحوی جهت رفع ابهام از معنای کلمات

The present study introduces a machine-based approach for word sense disambiguation (WSD). In Persian, a morphologically complex language, POS tag which lots of homographs are made, one way for doing WSD is allocating the right Part Of Speech (POS) tags to words prior to WSD. Since the frequency of noun and adjective homographs in different Persian POS tag text corpuses is high, POS tag disambi...

متن کامل

Part-of-speech Tagging of Code-mixed Social Media Content: Pipeline, Stacking and Joint Modelling

Multilingual users of social media sometimes use multiple languages during conversation. Mixing multiple languages in content is known as code-mixing. We annotate a subset of a trilingual code-mixed corpus (Barman et al., 2014) with part-of-speech (POS) tags. We investigate two state-of-the-art POS tagging techniques for code-mixed content and combine the features of the two systems to build a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016